{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d3bc6070-78e2-4074-95c5-476a163bff97",
   "metadata": {},
   "source": [
    "# Github Issue Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "464dd9fc-461c-4e20-a2a4-9d571e00a762",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a17672c-7d5b-454b-afd3-b83854b75f8c",
   "metadata": {},
   "source": [
    "To use the github repo issue loader, you need to set your github token in the environment.  \n",
    "\n",
    "See [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) for how to get a github token.  \n",
    "See [llama-hub](https://llama-hub-ui.vercel.app/l/github_repo_issues) for more details about the loader."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4d8d6d4-bb7c-4b96-88e7-5371701399ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"GITHUB_TOKEN\"] = \"<your github token>\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a18130fe-7ec7-491b-b020-15302f0114a1",
   "metadata": {},
   "source": [
    "## Load Github Issue tickets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d7b782d-21e2-4793-a6d5-fccfb41898c1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 100 issues in the repo page 1\n",
      "Resulted in 100 documents\n",
      "Found 100 issues in the repo page 2\n",
      "Resulted in 200 documents\n",
      "Found 100 issues in the repo page 3\n",
      "Resulted in 300 documents\n",
      "Found 100 issues in the repo page 4\n",
      "Resulted in 400 documents\n",
      "Found 4 issues in the repo page 5\n",
      "Resulted in 404 documents\n",
      "No more issues found, stopping\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "from llama_hub.github_repo_issues import (\n",
    "    GitHubRepositoryIssuesReader,\n",
    "    GitHubIssuesClient,\n",
    ")\n",
    "\n",
    "github_client = GitHubIssuesClient()\n",
    "loader = GitHubRepositoryIssuesReader(\n",
    "    github_client,\n",
    "    owner=\"jerryjliu\",\n",
    "    repo=\"llama_index\",\n",
    "    verbose=True,\n",
    ")\n",
    "\n",
    "docs = loader.load_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b3fa69a-0ad3-4dbe-b9f8-7bd72a7ab430",
   "metadata": {},
   "source": [
    "Quick inspection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3a33ec8-ac00-4624-b2da-3bd524632822",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"feat(context length): QnA Summarization as a relevant information extractor\\n### Feature Description\\r\\n\\r\\nSummarizer can help in cases where the information is evenly distributed in the document i.e. a large amount of context is required but the language is verbose or there are many irrelevant details. Summarization specific to the query can help.\\r\\n\\r\\nEither cheap local model or even LLM are options; the latter for reducing latency due to large context window in RAG. \\r\\n\\r\\nAnother place where it helps is that percentile and top_k don't account for variable information density. (However, this may be solved with inter-node sub-node reranking). \\r\\n\""
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[10].text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b3b9ade-1d1d-48d9-beb6-f0a9579a16fc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'state': 'open',\n",
       " 'created_at': '2023-07-13T11:16:30Z',\n",
       " 'url': 'https://api.github.com/repos/jerryjliu/llama_index/issues/6889',\n",
       " 'source': 'https://github.com/jerryjliu/llama_index/issues/6889'}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[10].metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "acdf5bce-d5af-4394-9b04-38f414a91fd6",
   "metadata": {},
   "source": [
    "## Extract themes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aab1255c-7a73-48f6-b6f5-7a0c3a6201b9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The autoreload extension is already loaded. To reload it, use:\n",
      "  %reload_ext autoreload\n"
     ]
    }
   ],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0eb64536-8a94-44ba-865f-75169a7f46e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pydantic import BaseModel\n",
    "from typing import List\n",
    "from tqdm.asyncio import asyncio\n",
    "\n",
    "\n",
    "from llama_index.program import OpenAIPydanticProgram\n",
    "from llama_index.llms import OpenAI\n",
    "from llama_index.async_utils import batch_gather"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31cc8987-34d6-4922-9b18-1d0458afa702",
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt_template_str = \"\"\"\\\n",
    "Here is a Github Issue ticket.\n",
    "\n",
    "{ticket}\n",
    "\n",
    "Please extract central themes and output a list of tags.\\\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "433d2f49-790a-4b37-a3e6-8c906c87e645",
   "metadata": {},
   "outputs": [],
   "source": [
    "class TagList(BaseModel):\n",
    "    \"\"\"A list of tags corresponding to central themes of an issue.\"\"\"\n",
    "\n",
    "    tags: List[str]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a024ff9e-2f2d-41d6-acc3-8506a68f8e17",
   "metadata": {},
   "outputs": [],
   "source": [
    "program = OpenAIPydanticProgram.from_defaults(\n",
    "    prompt_template_str=prompt_template_str,\n",
    "    output_cls=TagList,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9be86519-6d3d-4890-bec7-52445159677b",
   "metadata": {},
   "outputs": [],
   "source": [
    "tasks = [program.acall(ticket=doc) for doc in docs]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2092d673-8e72-4d7f-8305-0a1afdacfc09",
   "metadata": {},
   "outputs": [],
   "source": [
    "output = await batch_gather(tasks, batch_size=10, verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f05ee26-5498-4113-9b02-34fb8f576f1d",
   "metadata": {},
   "source": [
    "## [Optional] Save/Load Extracted Themes "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43e77c77-a2ab-4861-b8c7-d40fb6b334d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7df8871-a6ce-46f7-a7bc-b3900230f7b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"github_issue_analysis_data.pkl\", \"wb\") as f:\n",
    "    pickle.dump(tag_lists, f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9164228b-11ec-428e-b72f-862241d5c7f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"github_issue_analysis_data.pkl\", \"rb\") as f:\n",
    "    tag_lists = pickle.load(f)\n",
    "    print(f\"Loaded tag lists for {len(tag_lists)} tickets\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dff845af-0cba-4671-8fc1-e0922cc7b77c",
   "metadata": {},
   "source": [
    "## Summarize Themes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d8f26d1-7d5a-42ac-b5fe-8b69c054e0f3",
   "metadata": {},
   "source": [
    "Build prompt "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78a439ee-daa6-445c-ba7f-56d57384d299",
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = \"\"\"\n",
    "Here is a list of central themes (in the form of tags) extracted from a list of Github Issue tickets.\n",
    "Tags for each ticket is separated by 2 newlines.\n",
    "\n",
    "{tag_lists_str}\n",
    "\n",
    "Please summarize the key takeaways and what we should prioritize to fix.\n",
    "\"\"\"\n",
    "\n",
    "tag_lists_str = \"\\n\\n\".join([str(tag_list) for tag_list in tag_lists])\n",
    "\n",
    "prompt = prompt.format(tag_lists_str=tag_lists_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cec58c3-0435-409e-aa4e-ff54fd767ef4",
   "metadata": {},
   "source": [
    "Summarize with GPT-4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2ff22f7-5357-4eea-861b-fc7c0ca02fe7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.llms import OpenAI\n",
    "\n",
    "response = OpenAI(model=\"gpt-4\").stream_complete(prompt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58285f95-a6d7-4af0-9d83-ddcfcd78bbf7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. Bug Fixes: There are numerous bugs reported across different components such as 'Updating/Refreshing documents', 'Supabase Vector Store', 'Parsing', 'Qdrant', 'LLM event', 'Service context', 'Chroma db', 'Markdown Reader', 'Search_params', 'Index_params', 'MilvusVectorStore', 'SentenceSplitter', 'Embedding timeouts', 'PGVectorStore', 'NotionPageReader', 'VectorIndexRetriever', 'Knowledge Graph', 'LLM content', and 'Query engine'. These issues need to be prioritized and resolved to ensure smooth functioning of the system.\n",
      "\n",
      "2. Feature Requests: There are several feature requests like 'QnA Summarization', 'BEIR evaluation', 'Cross-Node Ranking', 'Node content', 'PruningMode', 'RelevanceMode', 'Local-model defaults', 'Dynamically selecting from multiple prompts', 'Human-In-The-Loop Multistep Query', 'Explore Tree-of-Thought', 'Postprocessing', 'Relevant Section Extraction', 'Original Source Reconstruction', 'Varied Latency in Retrieval', and 'MLFlow'. These features can enhance the capabilities of the system and should be considered for future development.\n",
      "\n",
      "3. Code Refactoring and Testing: There are mentions of code refactoring, testing, and code review. This indicates a need for improving code quality and ensuring robustness through comprehensive testing.\n",
      "\n",
      "4. Documentation: There are several mentions of documentation updates, indicating a need for better documentation to help users understand and use the system effectively.\n",
      "\n",
      "5. Integration: There are mentions of integration with other systems like 'BEIR', 'Langflow', 'Hugging Face', 'OpenAI', 'DynamoDB', and 'CometML'. This suggests a need for better interoperability with other systems.\n",
      "\n",
      "6. Performance and Efficiency: There are mentions of 'Parallelize sync APIs', 'Average query time', 'Efficiency', 'Upgrade', and 'Execution Plan'. This indicates a need for improving the performance and efficiency of the system.\n",
      "\n",
      "7. User Experience (UX): There are mentions of 'UX', 'Varied Latency in Retrieval', and 'Human-In-The-Loop Multistep Query'. This suggests a need for improving the user experience.\n",
      "\n",
      "8. Error Handling: There are several mentions of error handling, indicating a need for better error handling mechanisms to ensure system robustness.\n",
      "\n",
      "9. Authentication: There are mentions of 'authentication' and 'API key', indicating a need for secure access mechanisms.\n",
      "\n",
      "10. Multilingual Support: There is a mention of 'LLM中文应用交流微信群', indicating a need for multilingual support."
     ]
    }
   ],
   "source": [
    "for r in response:\n",
    "    print(r.delta, end=\"\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
